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Linking phenotypic with genotypic diversity has become a major requirement for basic and applied genome-centric bio- 
logical research. To meet this need, a comprehensive database backend for efficiently storing, querying and analyzing large 
experimental data sets is necessary. Chado, a generic, modular, community-based database schema is widely used in the 
biological community to store information associated with genome sequence data. To meet the need to also accommodate 
large-scale phenotyping and genotyping projects, a new Chado module called Natural Diversity has been developed. The 
module strictly adheres to the Chado remit of being generic and ontology driven. The flexibility of the new module is 
demonstrated in its capacity to store any type of experiment that either uses or generates specimens or stock organisms. 
Experiments may be grouped or structured hierarchically, whereas any kind of biological entity can be stored as the 
observed unit, from a specimen to be used in genotyping or phenotyping experiments, to a group of species collected 
in the field that will undergo further lab analysis. We describe details of the Natural Diversity module, including the design 
approach, the relational schema and use cases implemented in several databases. 



Introduction 

In the last 20 years, high-throughput technology develop- 
ments have revolutionized biology and transformed it into 
an information-based science. The first spur in data gener- 
ation occurred in the early 1990's when large-scale sequen- 
cing became available through Sanger technology for 
relatively small-scale projects such as EST and BAC sequencing 
(1,2). In more recent years, the number of nucleic acids and 
protein sequences freely available at data repositories, such 
as GenBank, has grown exponentially as higher throughput 



and relatively inexpensive sequencing and genotyping plat- 
forms have enabled widespread generation of genomic data. 
Although Model-Organism Databases (MODs) were designed 
for storing genomic data and their derived annotations, they 
all face similar challenges in answering the needs of the de- 
veloper and user community, including how to store the data 
efficiently, and how to adapt to newly emerging data types 
and represent them in a meaningful way so that biological 
questions can be answered. 

At the core of the Generic Model Organism Database 
(GMOD) is a generic schema, named Chado, which was 
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initially designed for storing Drosophila data at FlyBase, 
with the vision of creating a reusable and generic open 
source schema (3). Chado is ontology-driven and modular, 
and thus highly flexible. Chado's design principles enable 
the same schema to be used in projects with widely differ- 
ent metadata. Metadata can be modified or added as new 
data types become available. Its modular design allows de- 
velopers to select those parts needed to manage their data, 
and to add new modules when advances in biology require 
new data types. Currently, the Chado schema consists of 18 
modules, covering sequence, phenotype, genotype, ontol- 
ogies, publications and phylogenies (http://gmod.org/wiki/ 
Chado), with 23 genomic databases reporting that they use 
some or all of the Chado modules (http://gmod.org/wiki/ 
GMOD_Users). As a GMOD component Chado is open 
source, and therefore any user can contribute to the 
schema and the underlying code (http://sourceforge.net/ 
projects/gmod/), provided those contributions are consist- 
ent with the Chado generic design principle. 

The initial development of Chado focused on genome 
sequence data, but as more complex data was generated, 
such as microarray and expression data, new modules were 
added to accommodate new data types. Chado has also 
proven useful for handling multiple closely related organ- 
isms. Clade Oriented Databases (CODs) using Chado include 
the Sol Genomics Network [SGN; http://solgenomics.net/ 
(4)]; Genome Database for Rosaceae [GDR; www.rosaceae 
.org (5)]; Citrus Genome Database (www.citrusgenomedb 
.org); Cool Season Food Legume Genome Database (www 
.gabcsfl.org); KnowPulse (http://knowpulse2.usask.ca/ 
portal/) and the Genome Database for Vaccinium (www 
.vaccinium.org). 

Unlike the exponential increase in sequencing data, 
phenotypic data has been growing at a much slower pace. 
Although count, structure and functional annotation for 
genes can be derived in silico using sequence similarity and 
other methods, analysis tools for correlating phenotype with 
genotype fall far behind those for sequence analysis (6,7). 

A problem in phenotyping is the lack of genetic diversity 
in cultivated plants, which underwent heavy selection 
during domestication (8-11), causing a decrease in geno- 
typic variation. As a result, cultivated plants may have as 
little as 5% of the natural diversity found in their wild 
counterparts. With such low allele pools, there is also a 
dramatic decrease in phenotypic variation. Another prob- 
lem is related to difficulties in high-throughput phenotyp- 
ing. For genotyping, one can choose from numerous 
available technologies according to desired quality and 
funds available. The end product is always a molecular se- 
quence, and hence a uniform data type for which there are 
well understood and standard ways for processing and stor- 
age. In contrast, the data collection process for phenotyp- 
ing is slow, expensive and subjective to the person 
collecting the data, generally with no set standards for 



terminology or descriptors to capture phenotype observa- 
tions. Moreover, many phenotypes are subtle or even un- 
detectable to the naked eye. Traits controlled by multiple 
quantitative trait loci may have dozens of underlying 
genes, each having a small additive affect (12). In addition, 
traits may be sensitive to environmental effects and may 
exhibit interactions with the genotype. Because of these 
and other challenges, phenotype data are notoriously dif- 
ficult to record. 

Although new technology, such as automated computer- 
ized facilities for growing plant germplasm (13) and soft- 
ware for taking multiple computerized measurements of a 
specimen (14), may facilitate high-throughput phenotyp- 
ing, the challenge of capturing phenotypic diversity data 
remains complex, expensive and error-prone. 

Despite these difficulties, breeding programs generate 
large volumes of phenotypic data, which poses a challenge 
for databases to efficiently store, query and analyze these 
data. In addition, such programs require genetic informa- 
tion to be integrated with phenotype data for progeny se- 
lection and crossing designs. Large-scale genotyping, 
mostly based on next-generation sequencing-based SNP 
markers, is now routinely performed on the same group 
of individual plants or animals for which phenotype 
assays were done. Large-scale phenotyping and genotyping 
experiments are currently practiced in various projects 
(15-18) as well as in applied breeding experiments. 
Shared among these projects is the challenge of determin- 
ing how to best manage these data. 

In maize, three monocot databases, Panzea (19), 
Gramene (20) and GrainGenes (21), jointly developed the 
Genomic Diversity and Phenotype Data Model (GDPDM) to 
capture molecular and phenotypic diversity data. The core 
schema of GDPDM consists of tables for germplasm, pheno- 
type, genotype and environment that capture associations 
between phenotypes and genotypes. Although GDPDM 
performs well for its creators, the genericity of the 
schema is limited, and its design deviates enough from 
Chado's principles to make it difficult to adapt to other 
modules in Chado. As model organism and clade-oriented 
databases, which were already using Chado for storing and 
managing their data, increasingly faced the need for stor- 
ing large-scale diversity data efficiently, several of them 
collaborated on developing a schema module for Chado 
capable of recording a wide variety of phenotyping and 
genotyping experiments in a way that maintains links to 
stocks and germplasms. 

An initial version of a natural diversity module for 
Chado was developed at the National Evolutionary 
Synthesis Center (NESCent) in collaboration with W. Owen 
McMillan, who was a Center fellow at the time. McMillan is 
part of a community of researchers who use neotropical 
butterflies of the genus Heliconius as an emerging model 
system to study evolutionary genomics of Mullerian 
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mimicry and adaptation (22,23). The data collected in this 
context resemble those of a clade-oriented database, and 
although not previously managed in GMOD, the need to 
connect the results of genetic and phenotypic diversity ex- 
periments and a growing body of molecular data strongly 
motivated modifying the data model as a future Chado 
Natural Diversity (ND) module. Later, several model organ- 
ism (Medicago), clade-oriented (SGN, GDR, VectorBase) and 
plant-breeding (KnowPulse) databases formed a working 
group to collaboratively take up further development of 
the ND module. The goal of the working group was to 
mature the module to become an official part of Chado, 
fully consistent with its design principles and requirements, 
and as part of that expand its capacity for storing data from 
multiple experiments of specimens that were collected, 
treated and evaluated in various locations, environments 
and time points. 

As a result of these revisions, the module now allows the 
storage of data from each experimental line that are scored 
for a large number of phenotypic traits, and genotyped 
with a set of genetic markers. In addition to storing data 
from experiments performed on existing lines, experiments 
that generate new lines and experimental samples, such as 
field collections, crosses and treatments, can be stored. In 
the remainder of the manuscript, the design of the ND 
schema and the use-cases from several of the databases 
participating in the working group are described. 

Schema design approach 

The development and maturation of the ND schema fol- 
lowed several guidelines imposed by best practices for 
Chado module design (flexibility, ontology-driven metadata, 
reuse of existing modules), and motivated by naming of 
schema elements that is intuitive yet remains unencum- 
bered by potential name clashes with other modules. The 
following paragraphs present details on each of these. 

Flexibility 

As the ND module targets any type of experiment related 
to the collection of phenotypes and genotypes, the most 
important feature was to have a generic place for storing 
various assays. The ND schema was designed in a way that 
does not restrict the type of specimen that can be stored in 
the stock table and allows linking the specimen to any ex- 
periment via the nd_experiment_stock table. The 'experi- 
ment' name space does not necessarily refer to a complete 
biological experiment, but rather to an entity on the data- 
base level, sharing common attributes. For example, every 
time a phenotype or genotype is scored on a single stock, a 
new row is stored in the nd_experiment table. Likewise, an 
experiment can yield a new stock, thus the nd_experiment 
table is simply a placeholder allowing many-to-many rela- 
tionships between stocks and experiments. 



The nd_experiment table defines the type of experiment 
using a foreign key to the controlled vocabulary table 
(cvterm), eliminating the need for multiple tables for dif- 
ferent types of experiments or assays, and also providing a 
link to the phenotype and genotype values. 

The types of experiments tested during development of 
this module are: 

(i) phenotyping: any experiment involving scoring the 
subject for a trait value; 

(ii) genotyping: any experiment yielding a value of the 
genetic makeup of the subject; 

(iii) cross or mutation experiment: an assay involving 
crossing two subjects or mutation to generate new 
stocks; and 

(iv) field collection: collection of data or specimens in the 
field for further analysis in the lab. 

The generic design of the nd_experiment table, and op- 
tional many-to-many relationship between experiments 
and stocks, is flexible enough to allow other experiment 
types with the ND module. 

Ontology-driven metadata 

The ND module uses controlled-vocabulary (or ontology) 
terms for metadata rather than hard-coded column 
names whenever sensible. Metadata vocabulary terms are 
stored in the cvterm table, thus keeping domain entities 
unified and reusable. For example, if a stock is of type 
'plant accession', this type is stored as a cvterm, which can 
then be used for other stocks of the same type. In the ND 
schema, the types of experiments and reagents are also 
modeled as links to the respective cvterms. The same ap- 
proach is applied to all metadata properties of relational 
entities, for example dates or comments. Metadata proper- 
ties are stored in property association tables, which are con- 
ventionally named by appending the suffix 'prop' to the 
name of the entity table for which it stores properties. 
For example, the table nd_experimentprop stores meta- 
data properties for rows in nd_experiment. 

Reusing tables from existing Chado modules 

The ND modules reuse tables from the following modules, 
rather than redefining them: the stock module for stocks/ 
specimens, the controlled vocabulary module for using 
ontology terms, and the genetic and phenotype modules 
for storing the relevant scores. The project table and the 
newly added project_relationship table are used to group 
similar experiments. The contact and publication modules 
can be used to record contact and reference information 
for experiments. 

ND prefix for the tables in the new module 

The ND module introduces 'nd_' as a prefix for table names 
to allow intuitive naming of tables (and related relational 
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entities such as primary key sequences) whereas avoiding 
easily conceivable collisions with tables in other modules. 
For example, the central table of ND is 'experiment', which 
as a term is common to many areas of experimental biology 
but has sufficiently different semantics depending on the 
biological application that using one and the same table 
across all Chado modules that refer to an experiment 
would be prone to confusion. Trying to choose a distinct 
name can easily suggest inaccurate semantics although not 
necessarily avoiding the name collision problem; for ex- 
ample in earlier versions the nd_experiment table was 
named 'assay', which is less accurate yet still conflicts with 
the Chado MAGE module for microarray data. As the 
Chado database schema expands to more kinds of biologic- 
al data, prefixing the names of tables and primary keys may 
become a new recommended practice for Chado modules. 



Schema description 



The ND module consists of tables, which are used to store 
data from various experiments performed on biological 
entities and also track the relationships of experiments 
with data stored in other modules. Other key modules 
that are integral to the ND module include the Stock, 
Phenotype and Genetic modules (Figure 1). For current 
status of the ND schema see the GMOD wiki module page 
(http://gmod.org/wiki/Chado_Natural_Diversity_Module). 



Latest version of the Chado schema can also be found on 
the GMOD website (http://gmod.org/wiki/Downloads). 

Natural diversity module 

Experiments and their relationship with other 
data. The nd_experiment table is central to the ND 
module. The notion of experiment in this module is concep- 
tual, and may not reflect a whole biological experiment. 
Each nd_experiment has a type (common types are pheno- 
typing, genotyping, field collections, crossing, mutation 
and propagation), and a location. As a generic guideline 
whenever a person collects data at a certain date and loca- 
tion, a new nd_experiment row is stored. The existing 
Chado project table is used for storing the top-level experi- 
mental design, and nd_experiment_project table for group- 
ing nd_experiments and associating those with the project 
(Figure 2). 

Another important feature of experiments in ND is 
the ability to maintain many-to-many relationships be- 
tween each nd_experiment and stocks using the nd_exper- 
iment_stock linking table. 

One stock entry can be linked to multiple experiments 
through the nd_experiment_stock table. For phenotyping 
and genotyping experiments, many times these will be 
linked with a specific protocol. In such cases it is recom- 
mended that each row in the nd_experiment be linked to 
a single row of stock table and to a single row of genotype 
or phenotype table. Exceptions to this would include where 
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Figure 1. Core tables in the Natural Diversity module, and directly linked tables from other Chado modules. Databases are the 
five member organizations of the Natural Diversity development working group, and the APIs are some of the solutions 
currently used for interacting with the Chado schema. 
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a row of the nd_experiment is linked to multiple rows of 
the genotype table to store the genotype of a heterozy- 
gote. In this case, the genotype of each allele of the het- 
erozygote is stored in a distinct row of the genotype table. 
However, users can choose to store the genotype of a het- 
erozygote in one row of the genotype table. Multiple 
phenotypes can be linked to a single nd_experiment row 
when there is no need to store a specific protocol, for ex- 
ample when measuring lengths of several plant parts. 

A row of nd_experiment of type 'genotyping' or 'pheno- 
typing' represents the particular experiment performed on 
a specific sample of a biological entity. The associated 
metadata of an experiment such as the experimenter, 
date and comments can be stored in the nd_experiment- 
prop table. When a stock was subjected to any biotic or 
abiotic treatment prior to phenotypic evaluation, it can 
be linked to an experiment of type 'sample_treatment' 
and the metadata and the protocols of the sample treat- 
ment experiment can be stored in nd_experimentprop, 
protocol and protocolprop (see section 1.2). Since one 
row of the experiment of type 'genotyping' or 'pheno- 
typing' describes an assay that uses a single stock_id to 
give rise to a single genotype or phenotype, multiple 
nd_experiments that relate to each other can be linked to 
the same record in the 'project' table. A project can consist 



of several smaller sub-projects and the relationships among 
projects can be stored in the project_relationship table. The 
nd_geolocation table stores the details of the geolocation 
where the experiments were carried out, such as the geo- 
location of a field where plants were grown. 

Protocols, Reagents and their relationship with 
other data. The nd_protocol table stores any procedure 
performed as part of an experiment. Specific reagents used 
in a protocol can be stored in the nd_reagent table. 
Multiple rows of the nd_experiment table can be linked 
to a single row of the nd_protocol table, and multiple 
rows of the nd_protocol table can be linked to a single 
row of the nd_reagent table (Figure 2). 

Stock module 

The original stock module was designed to store informa- 
tion about stock collections in a laboratory. This original 
concept of 'stock' has been expanded to accommodate 
conceptual entities that a stock belongs to or entities 
that were derived from a stock for specific experiments. 
Hence, the stock table can store hierarchical entities of 
population, strain/line/accession, individual, clone and 
sample, with relationships between stocks defined in the 
stock_relationship table. For example, a stock could be a 



Page 5 of 13 



Database tool 



Database, Vol. 2011, Article ID bar051, doi:10.1093/database/bar051 



population maintained in a bottle in the lab composed of a 
mixture of multiple genotypes, or could be an inbred line of 
a plant whose genotype is fixed. In a phenotyping experi- 
ment of plants in field plots, with 10 plants for each geno- 
type, one would want to store each plant separately, since 
the phenotype is recorded on the individual plants. This 
method allows keeping track of the seeds of the progeny, 
collected from each plant for future experiments based on 
analysis results. In this case, each plant may be stored as a 
stock, with the parent line indicated in stock_relationship 
(e.g. objectjd is the parent accession, subject_id is the 
plant in the field plot, with type_id of # plot_of). 

Since a population can be defined as a group of entities, 
a population entity can be composed of multiple species 
such as a group of insects collected in a field. To accommo- 
date this concept, the 'not null' constraint for organism_id 
of the stock table has been dropped. Any metadata of a 
stock can be stored in # stockprop' table (http://gmod.org/ 
wiki/Chado_Stock_Module). 

Phenotype module 

The phenotype table is linked to the ND module via the 
nd_experiment_phenotype table. The descriptors of pheno- 
types can be stored using observable_id and/or attr_id 
of the phenotype table, and the specific phenotypic value 
of a biological entity from a specific experiment is stored 
in phenotype.value or phenotype. cvalue_id. The detail of 
the descriptors (observable_id and attr_id) can be stored in 
the cvterm table (http://gmod.org/wiki/Chado_Phenotype_ 
Module). 

The phenotype module was designed and used by 
FlyBase. Although the ND working group designed the ND 
module, a few basic issues concerning the phenotype table 
were raised. Currently, the phenotype table holds both the 
phenotype value, and potentially a post-composed cvterm 
[e.g. observable_id = # fruit_id' from the Plant Ontology (PO), 
attr_id = 'color' from the Phenotype And Trait Ontology 
(PATO)]. This design does not allow reusing post-composed 
terms, and does not separate the phenotype value from the 
descriptor. Since this problem does not have one agreed 
upon solution, and it is beyond the scope of the ND 
module, a new revision of the Chado phenotype module is 
required. Since revisiting the phenotype module may cause 
backward compatibility issues for existing users, any changes 
require careful documentation and migration scripts for 
updating the schema. Moreover, schema changes should 
deprecate, rather than delete, existing tables or columns. 

Genetic module 

The genotype table is linked to the ND module via the 
nd_experiment_genotype table. For large-scale genotyping 
data using molecular markers, the allelic variation can be 
stored in genotype. description and the markers that are 
stored in the feature table can be linked to the genotype 



table via the feature_genotype table (see the Chado gen- 
etic module http://gmod.org/wiki/Chado_Genetic_Module). 

Use cases (example data 
representations) 

The use cases given below are some examples of how vari- 
ous biological data can be stored in the ND module and 
other related Chado modules. Although the examples 
below of ND 'best practices' are the results of numerous 
discussions, meetings and different database and user re- 
quirements, the system provides flexibility, allowing the in- 
clusion of other data storage approaches. 

ND usage relies heavily on the stock module. In each of 
the following use cases, samples or accessions are stored in 
the stock module, and phenotype/genotype experimental 
results are stored in the ND module. The unique two-way 
linking table, nd_experiment_stock, allows storing experi- 
ments of existing stocks, as well as generating new stocks 
from various experiments, such as field collections. 

The various queries below simply illustrate how the data 
are stored but do not necessarily represent the best way to 
query the data. Materialized views can be created for faster 
query performance. 

Plant breeding data (Genome Database for Rosaceae) 

Genome Database for Rosaceae (GDR; http://www. rosaceae 
.org/) contains genetic and genomic data of the Rosaceae 
family (almond, apple, apricot, blackberry, cherry, peach, 
pear, plum, raspberry, rose and strawberry). Currently, 
GDR also contains private breeding data from individual 
breeders and public breeding data from the RosBREED pro- 
ject (15). In fruit tree breeding programs, progeny of tree 
fruit breeding crosses are used for genotype and pheno- 
type evaluations. The individuals of progeny are clonally 
propagated and planted in multiple locations so that 
breeders can obtain adequate quantity for further evalu- 
ations. Phenotype evaluations are done multiple times on 
the same clones. Samples can be collected from the same 
clone multiple times and the subsets of samples that are 
collected at the same time can be treated in different con- 
ditions before phenotypic evaluation. Genotyping can also 
be done multiple times on the same variety. 

Pedigree and cross data. The pedigree data of vari- 
eties are stored using the stock_relationship table. The ex- 
ample query below shows the pedigree information of the 
variety 'T1119'. 

SELECT sl.uniquename AS subject, c.name AS relation- 
ship, s2.uniquename AS object 
FROM stock s1 

JOIN stock_relationship sr ON s1 .stock_id = sr.subject_id 
JOIN stock s2 ON sr.object_id = s2.stock_id 
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JOIN cvterm c ON sr.type_id = c.cvterm_id 
WHERE s2.uniquename = T1119 
OR s1.uniquename = T1 1 19 



Subject Relationship Object 



Splendour 


is_ 


_a_ 


_maternal_parent_of 


T1119 


Gala 


is_ 


_a_ 


_paternal_parent_of 


T1119 


T1119 


is_ 


_a_ 


_maternal_parent_of 


C1111 


T1119 


is_ 


_a_ 


_paternal_parent_of 


C1112 


T1119 


is_ 


_a_ 


_maternal_parent_of 


C1113 


T1119 


Is. 


_a_ 


_mutation_parent_of 


C1114 



Crossing experiments. Crosses between accessions are 
stored as an nd_experiment with type of 'cross experiment', 
and linked to the parent accessions via the nd_experi- 
ment_stock. Derived progeny can then be stored in the 
stock and stock_relationship tables. This practice allows 
keeping track of pedigrees and progeny accessions used 
in subsequent experiments. 

Phenotype evaluation. In fruit tree breeding, a variety 
is usually propagated to produce multiple clones and the 
phenotype of each clone is evaluated multiple times. Each 
sample that gives rise to one phenotypic value is associated 
with one nd_experiment. The sample can be stored in the 
stock table and the relationships among varieties, clones 
and samples can be stored in the stock_relationship table. 
The trait descriptors developed by individual breeders are 
stored in the cvterm table, and linked from pheno- 
type. attr_id. Currently, Solanaceae breeders' ontology 
exists, and Rosaceae breeders also developed standard 
trait descriptors as part of RosBreed project, which are 
also stored in the cvterm table. Phenotyping data from in- 
dividual breeders can be stored using standard descriptors 
as well as their individual descriptors, so that breeders can 
choose to view the data either way. When breeders want to 
compare their results with those from other breeders, they 
can choose to view the data in standard descriptors. 
Metadata of the samples, such as harvest dates, are 
stored as properties of the stock (stockprop table). 

An example query below shows all the phenotypes that 
are associated with a sample. If we want to query all the 
phenotype data for a specific variety, we can query all the 
phenotypic values of all the samples of a specific variety. 

SELECT s.uniquename AS sample, cl.name AS pheno- 
type, p.value AS value 
FROM stock s 

JOIN nd_experiment_stock es ON s.stock_id = es.stock_id 

JOIN nd_experiment_phenotype ep 

ON es.nd_experiment_id = ep.nd_experiment_id 

JOIN phenotype p ON ep.phenotype_id = p.phenotype_id 

JOIN cvterm d ON p.attr_id = c1 .cvtermjd 



WHERE s.uniquename = T1120 



Sample 


Phenotype 


Value 


T1120 


SWEET 


3 


T1120 


OVERALL 


1 


T1120 


GRDCOL 


1.5 


T1120 


CRISP 


2 


T1120 


JUIC 


3 


T1120 


HUE 


5.50 



QTL and association mapping data (SGN) 

The Sol Genomics Network (SGN; http://solgenomics.net) 
database hosts an extensive phenotype and genotype data- 
base of Solanaceae plant accessions (4,24). Much of these 
data originate from breeding experiments of important 
crops of the family, mainly tomato and potato. One of 
the major goals is to collect the phenotypes and genotypes 
in standard formats for linking phenotypic variation with 
genetic markers, and eventually identify the underlying 
genotypes. 

A problem with acquiring breeding data from multiple 
resources is how to represent phenotypic scores of germ- 
plasm planted and collected by different people at differ- 
ent locations, dates and environments. The Chado ND 
module can address these issues. 

Phenotyping field experiments. Usually plant- 
breeding data is collected for single plants, situated in a 
field or greenhouse, with a predetermined experimental 
design of planting pattern to enable statistical analysis 
whereas eliminating potential interfering factors, such as 
location in the field. Each assayed plant is stored in the 
stock table, and stock_relationship holds the link to the 
parent accessions. Then each field experiment is stored in 
the nd_project table for grouping all subsequent phenotyp- 
ing and genotyping experiments on each individual plant in 
the field plot. 

The table nd_experiment is used as a placeholder for one 
phenotyping event or assay performed on one plant stock. 
Usually if several phenotypic scores are measured for one 
plant by one person (or computer software) on the same 
date and location, then one row is stored in nd_experiment 
and nd_experiment_stock, and each phenotype is recorded 
separately in the phenotype table. All the phenotypes are 
linked to the experiment via the nd_experiment_pheno- 
type table. 

Phenotypes may be quantitative or qualitative. The 
Chado phenotype module does not distinguish between 
the two modes, and the phenotypic value is stored in the 
phenotype.value field as a character, whether the value is 
numeric continuous, ordinal or a string descriptor. The 
phenotype is stored as Entity-Quality (EQ), and value 
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whenever applicable. Entity is the observable, and usually a 
Solanaceae Phenotype Ontology term (SP) is used 
(http://solgenomics.net/search/direct_search.pl7search = 
trait), which is a breeder-focused vocabulary developed 
according to user requirements (e.g. Plant habit index 
SP:0000128), essentially created as pre-composed terms 
taken from the Plant Ontology and PATO. The quality is 
the attribute, stored as a Phenotype and Trait Ontology 
term (PATO), or as a Solanaceae Phenotype term, which is 
mapped to a PATO term (e.g. SP:0000196 'spreading' is 
mapped to PATO:0001855 'horizontal'). This SP to PATO 
mapping allows SGN to give breeders direct access to 
trait descriptors, eliminating the need to use multiple voca- 
bularies, as well as using very granular and Solanaceae spe- 
cific trait values (e.g. SP:0000332 'spreading habit 30-15 
degrees'). 

Although statistical analysis of quantitative phenotypes 
is required for QTL detection and mapping, there is a need 
to filter the phenotypes linked to nd_experiments, and sub- 
sequently to stocks, such that only quantitative numeric 
phenotypic values are included. 

A sample query and results for finding the average 
phenotypes by quality is given below for retrieving the 
traits of the processing tomato accession 'Saucy'. Since 
phenotypes are not scored directly on the stock for the ac- 
cession, but rather on individual plants as observational 
units ('subjects') the query finds all the phenotypes of the 
accession's subjects. Further filtering is possible by project, 
especially if different plants were scored at different loca- 
tions of different environmental effects. 

SELECT attr.name AS attribute, avg(cast (phenotype 
.value as real)) 

FROM stock JOIN stock_relationship ON stock_id 
= subject_id 

JOIN nd_experiment_stock USING (stock_id) 

JOIN nd_experiment USING (nd_experiment_id) 

JOIN nd_experiment_phenotype USING (nd_experiment 

_id) 

JOIN phenotype USING (phenotype_id) 

JOIN cvterm AS attr ON attr_id = attr.cvtermjd 

WHERE object_id = (SELECT stockjd FROM stock WHERE 

name = 'Saucy') 

GROUP BY attribute; 



Attribute 


Average 


Horizontal asymmetry ovoid 


3.66 


Heart shape 


0.55 


Proximal fruit end indentation 


0.03 


Cross section average chroma 


36.89 


Cross section average 'b' value 


27.98 


Cross section average L value 


37.49 


Proximal angle macro 20% 


92.88 



Genotyping data. The individual plant samples that 
are phenotyped are also often genotyped. The genotype 
scores for each marker are stored in the nd_experiment 
table with a type_id of 'genotyping experiment', which 
are linked to the Chado genetic module via nd_ 
experiment_genotype. 

In QTL studies, phenotypes are correlated with differ- 
ences in genotypes (genetic markers showing polymorph- 
isms). The ND schema design allows simple querying of 
the data required for running statistical methods for find- 
ing traits which co-occur with a marker or a location on the 
genetic map. Currently, a module exists that performs QTL 
analysis for F2 and Back-cross bi-parental plant populations 
with continuous phenotypic variation (25). 

For association analysis of populations of unknown struc- 
ture, similar phenotype and genotype experimental data 
can be used. Although more elaborated statistical methods 
are required for association-mapping, all the necessary data 
for this analysis can be easily queried in a manner similar to 
the QTL tool. Existing tools, such as TASSEL (26), can then 
be applied to the data, and new statistical tools, for ex- 
ample, based on R (http://www.r-project.org/), will be 
implemented. 

Phenotypic assay of wild-caught samples (VectorBase) 

A large proportion of the data in VectorBase (http://www 
.vectorbase.org/) (27) concerns insecticide resistance. The 
process of assessment is generally three-fold. Mosquitoes 
are collected from the field, their species is identified and 
then they are subjected to the insecticide resistance assay. 
The ND schema is well suited to capturing all of these 
aspects. 

Each of these actions will be recorded as a separate entry 
in the nd_experiment table, such that a typical sample will 
be connected to one field collection assay, one species iden- 
tification, and one phenotypic assay. 

Field collection metadata. In addition to the field col- 
lection information, it is important to record the catch 
method and other metadata since it can significantly 
affect the types of samples found or have implications for 
the phenotype itself (e.g. a collection performed inside a 
dwelling strongly suggests an anthropophilic mosquito). 
This is stored in the nd_protocol / nd_protocolprop tables 
using terms from one of the vector biology-specific ontol- 
ogies such as MIRO (28) (e.g. MIRO:30000009 - 'indoor light 
trap catch'). 

The site of the field collection is stored in the nd_geolo- 
cation and nd_geolocationprop tables. The level of detail is 
left up to the individual submitters, but the nd_geolocation 
table expects a latitude and longitude, along with the geo- 
detic datum (e.g. 'WGS 84') and a gazetteer ID is a required 
field in VectoBase to specify location names. 
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Species Identification. Determining which of the mor- 
phologically identical species have been captured is most 
often achieved by comparing species-specific ribosomal 
DNA sequence, and would be recorded in the nd_proto- 
col/nd_protocolprop table in a similar manner to the field 
collection method (e.g. M I RO: 3 0000040 - 'PCR-based spe- 
cies identification'). Species can therefore be recorded in 
two different places, either in the nd_experiment_gen- 
otype/genotype tables, or directly in the stock/organism 
tables. 

In cases where a species identification assay has been 
provided the genotype tables should be used to record 
the direct result of this assay and an NCBI taxon ID if 
known. Where a stock species is known - regardless of 
whether we have assay data for this or not - this should 
be recorded in the stock table using the relevant NCBI 
taxon ID. It is worth noting that, although the organism 
field can be left null, it is informative - that the 5pp. has 
not been determined unambiguously. 

Phenotypic assay. Wild-caught mosquitoes are assayed 
for resistance to insecticides using a number of different 
methods. Many of these are standardised by the WHO 
(World Health Organization) and as such would be linked 
to the same records in the protocol table. Factors that 
change from assay to assay, such as the precise number of 
mosquitoes used or any varying insecticide concentrations 
would be recorded in the nd_experimentprop table. 

Though these assays may use different metrics, they are 
all an attempt to measure the same phenotype: the relative 
level of tolerance to the insecticide. As a result VectorBase 
stores this information both as numerical measurements 
and also as linked Entity-Quality (E-Q) values using dedi- 
cated vector biology ontologies [E] and PATO [Q]. In this 
way two methods may use different metrics (e.g. MIRO 
:20000013 Time Response Test may be measured in seconds, 
and MIRO:20000076 Dose Response Test may be measured 
in % mortality) yet the resultant phenotype for both could 
be the same (e.g MIRO:00000003/PATO:0001 650— metabol- 
ic resistance/increased resistance). 

Large-scale phenotyping and genotyping data 
(Medicago ecological genomics data) 

The model legume Medicago truncatula is an emerging 
system for ecological genomics and association mapping, 
since the genomes of at least 400 inbred lines will be 
re-sequenced and polymorphism data will be linked with 
phenotypic variation. Data are currently available for 40 
inbred genotypes that have been grown under a range of 
environmental conditions and phenotyped for the same 
suite of traits that have been formalized into a custom 
ontology. Treatments include different field sites, manipu- 
lated NaCI concentrations in the greenhouse, and varying 



NaCI environments experienced by the maternal plant of 
an individual seed. 

The project table is used for grouping a set of experi- 
ments, and each of those set consist of several assays per- 
formed on one individual. The information about each 
individual is stored in the stock table with a type_id of 
'inbred line'. Properties of the individual, such as 'plant 
id', 'location in holder', and 'soil replaced date', are 
stored in the stockprop table. The nd_experiment is 
linked to the stock via the nd_experiment_stock table. 
Genotyping and phenotyping assays are stored in nd_ex- 
periments with links to the genotype and phenotype 
tables respectively. 

The type of treatment applied to the individual stock 
is stored in the nd_protocol table using the type_id field 
(e.g. 'NaCI treatment'). The reagent 'NaCI' is stored in the 
nd_reagent table, and the treatment is linked to nd_experi- 
ment with a type_id of 'treatment' via the nd_experiment_ 
protocol table. 

Phenotype descriptions are stored in the Phenotype 
table. An example of a phenotype description is 'stem 
diameter at harvest (mm)'. This description is decomposed 
into an entity-quality term 'stem diameter', a temporal 
modifier 'at harvest', and unit 'mm'. The EQ term is defined 
in our Diversity Experiment Ontology (DEO). The term id is 
stored in the observable_id field, and the unit 'mm' is 
defined in the Unit Ontology (UO). The modifier 'at harvest' 
is composed of two subterms, a relation 'at' (or its synonym 
'exists_at') and the temporal quality 'harvest'. Both of these 
terms are defined in the DEO. 

Each phenotyping assay (nd_experiment) is linked to 
a phenotype via the nd_experiment_phenotype table. 
Statements linking the environment, genotype, and pheno- 
type, are stored in the phenstatement table. For example, a 
statement of this type would be "The mean of the pheno- 
type flower number in genotype TN7.4 given an environ- 
ment of NaCI treatment of 1 00 millimolar is 1 0". A row with 
uniquename of 'high salt' is stored in the environment 
table. A row with uniquename set to the project name 
and type_id of 'experimental result' is stored in the pub 
table. 

Genotype Evaluation (KnowPulse; http://knowpulse2 
.usask.ca/portal/) 

Genotypic assays are usually performed with DNA from a 
given individual. Often DNA is extracted mutliple times 
from a given individual and then multiple genotypic 
assays may be performed on each DNA or cDNA sample. 
Both the individual and each DNA sample from that indi- 
vidual are stored in the stock table and the samples are 
related back to the individual from which they were 
taken using the stock_relationship table. 

The basic goal of a genotypic assay is to identify the 
genotype of a given individual at a given genetic locus 
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where a genetic locus is defined as a position in the genome 
of the organism. Genetic loci are stored in the feature table 
(See Chado Sequence module). A list of genetic loci avail- 
able for screening for a given individual (Sheyenne) can be 
obtained by the following example query. 

SELECT f.name, f.uniquename, fl.fmin as position_start, 
fl.f max as position_end, srcf.name as feature_located_on 
FROM feature f 

JOIN cvterm t ON t.cvterm_id =f.type_id 

JOIN featureloc fl ON fl.feature_id = f.feature_id 

JOIN feature srcf ON srcf.feature_id = fl.srcfeature_id 

JOIN stock s ON s.organism_id = f.organism_id 

WHERE s.name = 'Sheyenne' 

AND t.name = , genetic_marker'; 

The genotype observed for a given loci is stored in 
the genotype table (See Chado Genetic module) with the 
description of the genotype observed, whether that be a 
sequence, presence/absence or some other standard de- 
scription. The locus for which the genotype was identified 
is linked to the genotype through the feature_genotype 
linking table. An example query listing all the observed 
genotypes for a given locus is shown below. 

SELECT g.name, g.uniquename, g. description as 

genotype 

FROM genotype g 

JOIN feature_genotype fg ON fg.genotype_id = 
g.genotype_id 

JOIN feature f ON f.feature_id = fg.feature_id 
WHERE f.uniquename = 'SUKHA-1'; 

The sample stock is linked to the observed genotype 
through the nd_experiment table since a genotypic assay 
was performed in order to determine what the observed 
genotype for a given sample is. Thus, as with phenoypic 
data, a single experiment in the nd_experiment table iden- 
tifies the genotype of a single genetic locus in a single 
sample. The protocols of the genotypic assay can be 
stored in nd_experiment and other associated tables (See 
use cases of Medicago ecological genomics data). The 
sample is linked to the experiment through the nd_exper- 
iment_stock linking table and the observed genotype is 
linked to the experiment through the nd_experiment_gen- 
otype linking table. An example query is shown below 
which lists all of the stocks that have been assayed at a 
given locus including the genotype observed and the ex- 
periment that identified the genotype. 

SELECT geo. description as location, expt.name as experi- 
ment_type, s.name as stock_name, s.uniquename as 
stock_uniquename, g. description as genotype 
FROM nd_experiment exp 



JOIN nd_geolocation geo ON geo.nd_geolocation_id 
= exp.nd_geolocation_id 

JOIN cvterm expt ON expt.cvterm_id = exp.type_id 

JOIN nd_experiment_stock exps ON exps.nd_experiment 

_id = exp.nd_experiment_id 

JOIN stock s ON s.stock_id = exps.stock_id 

JOIN nd_experiment_genotype expg ON expg.nd 

_experiment_id = exp.nd_experiment_id 

JOIN genotype g ON g.genotype_id = expg.genotype_id 

JOIN feature_genotype fg ON fg.genotype_id 

= g.genotype_id 

JOIN feature f ON f.feature_id = fg.feature_id 
WHERE f.name = 'SUKHA-1'; 

Application Programming 
Interfaces 

The ND module provides a generic schema for storing ex- 
perimental results. Most resources using this schema will 
require front-end applications with graphical interfaces 
for their users and possibly for data curation purposes. 
Application Programming Interface (API) development is 
independent from the schema, as long as the software 
knows how to connect, query, and fetch results from the 
database back-end. The following list, not meant to be ex- 
haustive, describes some examples of the ND schema APIs 
available for different programming languages. 

(1) Bio::Chado::Schema (http://gmod.org/wiki/Bio::Chado 
::Schema) 

Bio::Chado::Schema is a GMOD project that provides 
a standard object-relational mapping (ORM) layer in Perl 
for the Chado schema. It is implemented using DBIx::Class 
(http://www.dbix-class.org/) and provides support for all 
Chado modules, including the ND module. 

(2) Grails' Object Relational Mapping (GORM) and 
Hibernate 

Hibernate is a Java ORM. The Hibernate Tools can be 
used in Eclipse or Ant to reverse engineer the Chado 
schema into Hibernate domain classes. GORM is part of 
the Groovy/Grails web framework. It can use the 
Hibernate classes to provide simple findByAttribute 
queries, criteria queries involving more than two attributes, 
and queries via the Hibernate Query Language (HQL). 

(3) Drupal and Tripal (http://gmod.org/wiki/Tripal) 

Drupal is a content management system that provides 
various practical benefits as a front-end for Chado. Tripal 
(29) is a GMOD project that serves as a web front end based 
on a collection of Drupal modules. The current version sup- 
ports various visualizations for various Chado modules and 
further development is underway to support ND module. 
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Discussion 

The ND Module was developed to store large-scale pheno- 
typing, genotyping and breeding data. Even though the 
existing Chado modules allow storing the genotype and 
phenotype of stocks, it cannot accommodate multiple 
genotypic and phenotypic values of stocks and experimen- 
tal metadata that are generated by large-scale experi- 
ments. To meet these needs, the Chado ND Module 
Working Group was formed, which consists of database de- 
velopers interested in using Chado to store their large-scale 
phenotyping and genotyping data, to collaboratively de- 
velop a new module. Since the working group consists of 
developers dealing with diverse organisms, projects and ex- 
perimental designs, the module was naturally conceived 
with flexibility in mind. 

The starting point was a ND module created in 2007 for 
managing genetic and phenotypic diversity data collected 
for Heliconius butterflies, an emerging model system for 
evolutionary genomics of mimicry and adaptation (22,23). 
Subsequent work focused on maturing the ND module into 
an officially accepted Chado module fully consistent with 
Chado's design principles and requirements (3). This 
included changes so that it can accommodate a wider var- 
iety of genotyping and phenotyping experiments as well as 
new types of stock with their associated phenotypes and 
genotypes. A key concept in the module's current design is 
to use the nd_experiment table for any kind of experiment 
or assay that uses or produces stocks. A single row of nd_ex- 
periment is created to link one stock to one distinct pheno- 
typic or genotypic value, stored in the phenotype or 
genotype table. Another key concept is the capacity to 
store biological entities of any hierarchical level in the 
stock table, and to store groups of experiments of any hier- 
archical level in the project table, just as features in the 
genome of any hierarchical level can be stored in the fea- 
ture table. This design gives flexibility to accommodate 
phenotyping and genotyping data from diverse types of 
projects. 

One of the main purposes developing open-source gen- 
eric schemata for biological data is to promote sharing of 
software that work with the schema, as well as to provide 
generic schema to develop new databases (30,31). 
However, due to the flexibility of the schema, there are 
many different ways to store large-scale phenotyping and 
genotyping data, which may present an obstacle to effi- 
cient sharing and reusing of software. To provide future 
database developers with some guideline and examples, 
this article includes various use cases adopted by diverse 
databases. An important future effort to promote the 
standardization of the usage of the schema includes de- 
veloping ontology to store various types of stock and 
their relationships. 



Although the level of normalization of the ND schema 
allows storing data in the magnitude of millions of entries, 
database performance is still an issue yet to be addressed, 
especially with very large data sets, such as SNP genotyping. 
Other database management systems for genotype/pheno- 
type data, such as the GDPDM schema (19), should have 
similar performance issues, thus MOD users and developers 
will have to set methodologies for efficiently storing SNP 
and next-generation sequencing data (32), since all raw 
data should probably not be stored as-is in the database. 

The ND module was developed with plant breeding data 
(Solanaceae, Rosaceae and legume species) and large-scale 
natural diversity data (mosquitoes) in mind. These types of 
data are expected to grow rapidly due to the advance of 
high-throughput technology and the increasing effort 
toward marker-assisted breeding. The participating data- 
bases include GDR (5), SGN (4) and KnowPulse (http://know- 
pulse2.usask.ca/portal/). The collaboration was done mainly 
through conference calls and email correspondences. The 
development process was very efficient and productive; 
presenting this type of collaboration as a good model for 
further development of open-source bioinformatics tools. 

Software development and database management 
are the responsibility of each participating database. 
Although there is no one standard API for the ND module 
(and Chado in general), or even a standard programming 
language or Database Management System (DBMS), there 
are several software packages and modules that were 
written to interact with Chado, yet these are distributed 
separately. Some examples are Bio::Chado::Schema and 
Bio::DB::Das::Chado, both available from CPAN (http:// 
search/cpan.org) and Modware, available at Sourceforge 
(http://sourceforge.net/projects/gmod-ware/). The ND 
module is an integrated part of GMOD and the Chado 
schema, and can be downloaded and updated from SVN 
(http://gmod.Org/wiki/Chado#Chado_From_SVN), or down- 
loaded as part of the periodically updated Chado schema 
stable release (http://gmod.org/wiki/Downloads). Since 
Chado is a part of the GMOD open-source project (http:// 
gmod.org), it relies heavily on the user and developer com- 
munities to test, use, fix and contribute to both schema and 
code. See the GMOD web site for ways of obtaining help, or 
join the Natural Diversity mailing list (https://lists.sou rce 
forge.net/lists/listinfo/gmod-phendiver). 

ND module is expected to be revised and improved as 
more users may display other needs for data storage and 
analysis. As with any other open-source project, the success 
of this collaboration relies on future maintenance and 
software development by GMOD users. The participating 
groups in this project are currently writing loading scripts, 
APIs and documentation for ND module 'best practices'. 
Developing open-source code, which interacts with the 
schema (e.g. the code behind SGN is freely available from 
github https://github.com/solgenomics; 4), will make this 
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project more sustainable in the long run, as more users 
show interest in a platform for phenotype to genotype 
analyses. 
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